Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification
نویسنده
چکیده
This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the ‘long tail’ in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often underresourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes’ features) are released on the author’s webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.
منابع مشابه
Phonologically Informed Edit Distance Algorithms for Word Alignment with Low-Resource Languages
Edit distance is commonly used to relate cognates across languages. This technique is particularly relevant for the processing of lowresource languages because the sparse data from such a language can be significantly bolstered by connecting words in the lowresource language with cognates in a related, higher-resource language. We present three methods for weighting edit distance algorithms bas...
متن کاملComparing Clustering Algorithms for the Identification of Similar Pages in Web Applications
In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as ...
متن کاملLearning String Edit Distance
In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string edit distance. Our stochastic model allows us to learn the optimal string edit dis...
متن کاملForm similarity via Levenshtein distance between ortho-filtered logarithmic ruling-gap ratios
Geometric invariants are combined with edit distance to compare the ruling configuration of noisy filled-out forms. It is shown that gap-ratios used as features capture most of the ruling information of even low-resolution and poorly scanned form images, and that the edit distance is tolerant of missed and spurious rulings. No preprocessing is required and the potentially time-consuming string ...
متن کاملEdit Distance Based Encryption and Its Application
Edit distance, also known as Levenshtein distance, is a very useful tool to measure the similarity between two strings. It has been widely used in many applications such as natural language processing and bioinformatics. In this paper, we introduce a new type of fuzzy public key encryption called Edit Distance-based Encryption (EDE). In EDE, the encryptor can specify an alphabet string and a th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016